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1.  Introduction 

Vary  large  seals  integrated  (VLSI)  circuit  technology  has  made  it  possible  to  build 
multiprocessor  hardware  devices  to  aid  in  the  rapid  solution  of  sophisticated  problems. 
An  algorithms  designer  wishing  to  taka  full  advantage  of  the  massive  parallelism  offered 
by  VLSI  must  address  geometric  issues  hitherto  relegated  to  layout  artists.  The  reeson 
for  this  is  that  VLSI  it  a  planar  technology  in  which  the  interconnections  among 
components  on  a  chip  may  cost  more  than  the  components  themselves.  The  designer  of  a 
multiprocessor  algorithm  to  be  implemented  in  this  technology  must  consider  the 
complexity  of  the  data  paths  between  processors  in  evaluating  the  algorithm. 

Many  programming  applications  require  the  ability  to  insert  records  into  a  set,  and  at 
any  time  to  retrieve  from  the  set  the  record  having  the  smallest  key  according  to  some 
ordering.  A  data  structure  that  provides  such  services  is  called  a  priority  queue.  (See 
Knuth  [19731  pp.  150-152  and  Aho,  Hopcroft,  and  Ullman  [19741  pp.  147-152.)  The 
operation  INSERTION  replaces  the  set  Q  with  the  set  Q  u  {a}.  The  operation 
EXTRACT_M1N<PJ  returns  the  smallest  element  a  of  0  end  replaces  Q  with  Q  -  {e}.  This 
paper  shows  how  high-performance  priority  queues  can  be  built  using  the  VLSI 
technology. 

Section  2  of  this  paper  discusses  systolic  systems,  the  model  of  parallel  computation 
used  for  this  work.  Section  3  presents  a  systolic  array  implementation  of  a  priority 
queue.  Section  4  shows  how  multiple  priority  queues  can  be  implemented  as  a  single 
device  that  shares  processors  among  the  queues.  The  organization  of  the  shared 
structure  is  presented  in  Section  5.  Section  $  deals  with  the  geometric  layout  of  the 
multiple  queue  device  in  VLSL  The  conclusion  is  presented  in  Section  7. 

2.  Systolic  Systems 

A  systolic  system  is  a  network  of  processors  that  rhythmically  compute  and  paas  data 
among  themselves.  The  analogy  is  to  the  rhythmic  contraction  of  the  heart  which  pulses 
blood  through  the  circulatory  system  of  the  body.  Each  processor  in  a  systolic  network 
can  be  thought  of  as  a  heart  that  pumps  multiple  streams  of  data  through  itself.  The 
regular  beating  of  these  parallel  processors  keeps  up  a  constant  flow  of  data  throughout 
the  entire  network.  As  a  processor  pumps  data  items  through,  it  performs  some 
constant-time  computation  and  may  update  some  of  the  Items.  . 
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Systolic  systems  provide  s  realistic  model  of  computation  which  captures  the  concepts 
of  pipelining,  parallelism,  and  interconnection  structures.  Kung  and  Leiserson  [1978] 
demonstrates  that  many  basic  matrix  computations  can  be  performed  by  systolic  systems 
whose  underlying  network  is  array  structured.  These  ivstollc  arrays  are  suitable  for 
implementetlon  as  VLSI  hardware  devices.  This  paper  will  show  the  utility  of  systolic 
trees. 

Unlike  the  closed-loop  circulatory  system  of  the  body,  a  systolic  computing  system 
usually  has  ports  into  which  inputs  flow,  and  ports  from  which  the  results  are  retrieved. 
Thus  e  systolic  system  can  be  a  pipelined  system  -  input  and  output  occur  with  every 
pulsation.  This  makes  them  attractive  as  peripheral  processors  attached  to  the  dete 
channel  of  a  host  computer.  Figure  1  illustrates  how  a  special-purpose  systolic  device 
might  form  a  part  of  a  POP- 11  system.  A  systolic  system  might  be  attached  directly  to 
the  CPU  of  a  Von  Neumann  machine,  much  as  a  floating-point  processor  may  be  edded  to 
extend  the  instruction  set  of  a  computer. 


Figure  It  A  systolic  device  connected  to  the  UNIBUS  of  a  POP-11. 

The  activities  of  the  processors  in  a  systolic  system  can  be  assumed  to  be 
synchronous.  With  each  pulse  of  a  dock,  a  processor  executes  the  same  constant -time 
program.  Furthermore,  each  processor  is  only  allowed  a  fixed  number  of  input  end 
output  lines  end  a  constant  amount  of  local  storage.  It  is  possible  to  view  the  processors 
as  being  asynchronous,  each  computing  its  output  values  when  all  its  Inputs  are  available, 
as  in  the  data  flow  model.  For  the  results  of  this  paper,  the  synchronous  approach  Is 
more  direct  and  intuitive. 
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3.  A  Simple  Systolic  Priority  Queue* 

A  linear  systolic  array  can  implement  a  last  priority  queue.  Each  processor  in  this 
array  has  two  registers  A  and  B,  and  each  processor  can  access  the  registers  of  its  two 
neighbors,  as  shown  in  Figure  2.  The  A  registers  hold  elements  in  the  queue  In  sorted 
order,  with  the  smallest  element  in  Aj.  The  B  registers  contain  elements  that  ere  being 
inserted  into  the  queue.  Initially,  all  the  elements  in  the  queue  are  ♦oat.  The  priority 
queue  operations  INSERT  and  EXTRACT_MIN  are  performed  by  the  user  at  the  left  end  in 
the  diagram.  As  items  are  inserted  by  the  host,  they  displace  overflow  elements  which 
are  output  at  the  right  end.  Normally,  the  overflow  element  will  be  ♦«,  but  when  this  is 
not  the  case,  a  real  overflow  has  occurred. 


Hoot 

Computer 


Overflow 


Figure  2s  A  simple  systolic  priority  queue. 

Even  and  odd  numbered  processors  alternately  pulsate,  each  time  executing  the 
following: 

1.  B|  «-  Bj.j. 

2.  Arrange  the  elements  in  Aj_j,  Aj,  and  B;  so  that  Aj.j  s  Aj  i  Bj. 

Processor  0  is  a  dummy  processor  which  does  not  execute  any  code,  but  whose  registers 
can  be  altered  by  the  host  machine.  The  array  pulses  twice  each  time  an  operation  is 
performed  by  the  host  machine,  once  for  odd  numbered  processors  and  once  for  those 
with  even  positions.  The  operation  INSERTION  is  implemented  by  piecing  the  item  e  in 
B|q  end  -co  in  Aq  just  before  processor  1  pulses.  Each  element  travels  to  the  right  until 
It  finds  its  place  in  the  array. 

By  loading  Aq  and  Bq  with  *00,  the  pulsation  of  the  systolic  array  causes 
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Figure  3t  Several  step*  in  the  execution  of  the  systolic  errey 
shown  in  Figure  2. 
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EXTRACT_MIN  to  bo  performed,  tho  minimum  value  being  found  in  Aq.  With  eech  pair  of 
pulsations  the  systolic  array  Is  ready  to  execute  another  INSERT  or  EXTRACT_MIN 
operation.  Figure  3  shows  several  steps  in  the  execution  of  this  systolic  array.  The 
initial  configuration  in  the  figure  shows  the  insertion  of  items  9  and  IS  already  in 
progress.  Although  it  may  take  an  element  a  long  time  to  find  its  place  in  the  systolic 
array,  to  the  host  computer  an  INSERT  operation  appears  to  take  only  constant  time. 
Since  the  minimum  element  in  the  queue  is  always  at  the  front,  an  EXTRACT_MIN 
operation  also  appears  to  take  constant  time.  The  operation  of  the  systolic  array  is 
pipelined  so  that  no  degradation  o  cur j  even  when  the  host  executes  many  priority 
requests  in  a  row.  Thus  we  may  say  that  tha  systolic  array  has  a  response  time  which  Is 
a  constant,  independent  of  the  length  of  the  array. 

4.  The  Systolic  Multiqueue 

Suppose  several  of  the  simple  priority  queues  in  Section  3  are  attached  as  a  device  to 
a  host  computer.  No  matter  how  a  fixed  number  of  processors  are  allocated,  the  capacity 
of  any  particular  queue  may  be  exceeded  while  most  of  the  other  queues  are  empty.  In 
this  section  a  single  device  is  presented  that  is  capable  of  implementing  many  priority 
queues  that  dynamically  share  processors.  Like  the  simple  queue  in  the  previous  section, 
the  systolic  multiqueue  can  perform  INSERT  and  EXTRACT_MIN  for  a  single  host 
computer,  on  any  of  m  queues,  with  a  response  time  that  is  a  constant,  independent  of 
the  size  of  the  queue. 

Figure  4  illustrates  tho  organization  of  the  systolic  multiqueue.  Each  of  the  m  queues 
to  be  implemented  requires  a  systolic  array  of  the  type  presented  in  Section  3.  These 
can  be  accessed  directly  by  the  host  computer.  When  a  systolic  array  overflows,  the 
overflow  element  travels  through  a  switching  network  to  a  large  systolic  tree.  Each  time 
the  minimum  of  a  particular  queue  is  extracted  from  the  corresponding  systolic  array,  the 
minimum  of  the  elements  that  have  overflowed  from  that  queue  is  removed  from  the 
systolic  tree.  The  internal  structure  of  this  shared  overflow  area  is  examined  In  Section 

5.  Here,  we  only  need  to  know  its  behavior. 

The  records  stored  in  the  systolic  tree  are  the  same  as  those  in  the  systolic  arrays, 
but  an  additional  field  is  used  to  identify  the  queue  from  which  the  item  originated.  Thus 
items  are  stored  in  the  systolic  tree  according  to  a  composite  record  <Q,o>  where  e  was 
originally  inserted  by  an  INSERTlQ,*)  operation  and  eventually  overflowed  from  the 
systolic  array  corresponding  to  that  queue. 


Switching  Network 


Figure  4*  The  systolic  multiqueue  device. 


The  operations  that  can  be  performed  by  the  systolic  tree  are  very  similar  to  those 
commands  given  to  the  entire  systolic  multiqueue  by  the  host.  The  composite  record 
<0,a>  is  inserted  into  the  systolic  tree  by  INSERTION  and  EXTRACT  J4BMQ)  removes  the 
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smallest  element  in  the  systolic  tree  which  hes  Q  ss  its  first  field  As  will  be  seen  In 
Section  5,  s  systolic  tree  of  size  n  can  perform  each  operation  in  time  Ode g  n).  The 
operations  are  pipelined  however,  so  that  tha  systolic  trae  can  process  several 
operations  in  parallel,  waiting  only  constant  time  between  successive  operations.  For 
example,  if  a  sequence  of  EXTRACT  JOIN'S  are  started  constant  time  apart,  it  will  take 
Odog  n)  for  the  first  minimum  to  be  retrieved  but  then  results  appear  with  every  cycle 
of  the  systolic  tree.  Thus  the  systolic  tree  provides  high  throughput,  with  Odog  n) 
response. 

The  claim  was  made,  however,  that  tha  systolic  multiqueue  had  constant  response  time 
for  each  of  n  queues.  Systolic  arrays  of  size  proportional  to  log  n  *  log  m  are  used  to 
achieve  this  goal  by  satisfying  any  immediate  requests  from  the  host.  When  the  host 
executes  EXTRACT_MIN<0),  that  operation  is  performed  on  the  corresponding  systolic 
array.  fhg.  same  time,  a  request  is  put  into  tha  systolic  tree  to  perform  an 
EXTRACT.MIPW?;.  A  result  is  yielded  by  (ha  systolic  tree  Odo g  n)  time  Mgr  and  takes 
Odog  m)  more  time  to  traverse  the  switching  network.  It  is  then  inserted  into  the 
systolic  array  at  the  same  end  the  host  computer  uses.  Even  if  the  host  has  performed 
log  n  *  tog  m  EXTRACT_MIN<PJ  operations  in  the  meantime,  the  quick  response  systolic 
array  has  been  able  to  satisfy  the  requests.  Now  if  the  host  continues  to  perform 
EXTRACT JMIN(0/s,  a  stream  of  results  from  the  systolic  tree  will  be  inserted  into  the 
array  just  in  time  to  satisfy  the  requests.  The  systolic  array  will  always  have  at  least 
one  item  in  it  because  operations  on  the  systolic  tree  are  pipelined.  It  does  not  matter 
whether  or  not  the  host  accesses  different  queues.  Since  it  can  only  access  one  queue 
at  a  time,  no  systolic  array  will  empty  before  the  beginning  of  a  stream  of  Items  from  the 
systolic  tree  has  reached  the  systolic  array. 

The  number  of  processors  in  the  systolic  arrays  is  m  log  n.  If  the  size  of  the  systolic 
tree  Is  doubled,  this  means  only  m  more  processors  need  be  added  to  the  systolic  arrays. 
The  amount  of  sharing  of  processors  among  the  m  queues  is  clearly  substantial. 
Furthermore,  the  systolic  multiqueue  will  not  overflow  until  the  shared  systolic  tree 
overflows.  In  fact,  overflow  of  the  systolic  tree  can  be  handled  "nicely"  as  will  be  seen 
in  Section  5. 

5.  Systolic  Trees 

It  seems  natural  to  use  a  trae -structured  hardware  device  to  achieve  pipelined 
performance  with  Odog  n)  response  time  for  INSERTION  and  EXTRACT JMINtpj.  After  all, 
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a  software  implementation  on  a  sequential  machine  can  guarantee  Oi log  n)  performance 
by  using  a  height-balanced  binary  search  tree.  AVL  trees,  2-3  trees,  and  B-treas  are 
popular  data  structures  which  have  this  performance.  For  a  sequential  implementation  of 
a  single  priority  queue,  a  heap  is  an  attractive  data  structure  because  heap  storage  can 
be  managed  as  easily  as  stack  storage.  (Aho,  Hopcroft,  and  Ullman  [1974]  has  a  good 
presentation  of  several  of  these  techniques.)  Unlike  most  programmed  implementations  of 
priority  queues,  however,  the  parallel  structure  required  by  the  systolic  multiqueue 
cannot  use  a  separate  data  structure  for  each  queue. 

A  major  problem  in  the  design  of  a  hardware  search  tree  is  that  the  standard  balanced 
tree  schemes  do  not  map  wall  onto  a  fixed  interconnection  structure.  A  sequential 
algorithm  can  move  the  tree  pointers  to  maintain  the  balance  of  the  tree.  Data  usually 
remains  in  fixed  locations.  Since  the  "pointers"  in  a  hardware  tree  are  electrical  wires, 
data  must  be  moved  to  maintain  the  balance  of  the  tree. 

Because  the  systolic  multiqueue  requires  the  operations  INSERTION)  and 
EXTRACT_M1N(Q),  Hoys  are  considered  to  be  from  a  composite  record  <Q, e>.  A  dummy 
queue  number  ♦«  is  used  to  indicate  an  empty  record.  Records  are  compared  by 
lexicographic  ordering,  that  is,  <Q,a>  <  if  0  <  0*  or  if  Q  -  &  and  ktyfa)  <  fcayfo’). 

It  is  useful  to  view  all  operations  on  the  tree  as  occurring  in  pairs 
[EXTRACT_MIN<0’),  lNSERTlQ,*)}  Normally,  the  paired  operation  involves  the  dummy 
queue  ♦«.  For  example,  when  an  insertion  is  performed  on  an  arbitrary  queue,  a  ♦« 
record  is  deleted  by  EXTRACT_MINf*«).  If  the  systolic  tree  overflows  from  too  many 
insertions,  however,  this  exceptional  condition  can  be  handled  by  the  operating  system  of 
the  host  computer.  The  job  using  a  particular  queue  can  be  disabled  and  the  elements  in 
that  queue  can  be  removed.  When  an  EXTRACT_MIN  frees  up  some  space,  the  elements 
of  thet  queue  can  be  "swapped"  back  in  by  the  paired  INSERT.  The  analogy  to  a  virtual 
memory  computer  which  has  a  swapping  drum  is  a  good  one.  Queues  can  be  managed 
just  like  any  other  operating  system  resource.  A  small  amount  of  bookkeeping  Is 
required  to  keep  for  each  queue,  the  number  of  items  in  the  tree. 

One  scheme  for  implementing  the  paired  operations  is  illustrated  in  Figure  5.  Each 
processor  in  a  systolic  array  is  also  a  leaf  of  a  systolic  tree.  A  processor  Pj  contains 
one  record  The  tree  serves  to  broadcast  paired  operations  to  the  processors 

end  to  retrieve  the  EXTRACT_MIN  results  from  the  processors.  A  paired  operation 
[EXTRACT_MIN(P’Jl  INSERTIQ*)]  will  reach  all  tha  processors  at  the  same  time.  Each 
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Figure  St  The  systolic  array-tree. 


processor  executes: 

1.  Extraction.  If  •  Q*  and  <  0*,  then  send  <0^>  up  the  tree  as  the 
result  of  EXTRACT_MIN<P,A 

2.  La!i  shift.  If  Qi.l  i  O',  then  shift  <)&>  left  to  PM. 

3.  Right  shift.  If  <0^  >  <Q,a>,  then  shift  <Q^>  right  to  P^j. 

4.  Insertion.  If  S  <0,o>  <  <0^*,  then  Pf  gets  <0a>. 

During  the  first  step,  each  processor  checks  to  see  whether  it  contains  the  item  to  be 
extracted.  After  that  item  is  sent  on  its  way  up  the  tree,  the  elements  to  the  right  slide 
left  to  take  up  the  empty  slot.  Then  the  position  for  the  insertion  is  determined,  and  the 
elements  to  the  right  of  that  processor  slide  right  to  make  room.  Finally,  the  item  to  be 
inserted  is  placed  in  the  slot  left  for  it.  Naturally,  the  shifts  can  be  optimized  so  that 
those  elements  that  slide  both  left  and  right  do  not  actually  have  to  move. 

Whereas  the  array-tree  keeps  all  the  data  at  the  leaves  of  the  tree,  the  systolic  tree 
shown  in  Figure  6  keeps  the  data  in  the  internal  nodes.  Consequently,  the  structure  is 
more  like  a  standard  search  trae.  The  processor  at  each  node  holds  two  records,  and 
has  connections  to  its  father  and  two  sons.  A  depth-first  tree  traversal  that  printa  the 
left  record,  recursively  visits  the  left  son,  recursively  visits  the  right  son,  and  then  prints 
the  right  record  will  print  out  the  values  in  lexicographic  order.  There  is  a  good  reason 
for  having  the  pointers  between  the  records  rather  than  the  normal  search  tree  method 
of  a  record  between  pointers.  A  balancing  similar  to  the  shift  step  in  the  systolic 
array-tree  can  be  performed  top-down  to  permit  pipelining  of  the  paired  operations. 
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Each  processor  need  only  look  at  itself  and  its  two  sons  to  determine  the  shift.  The 
topology  of  this  tree  is  in  some  sense  superior  to  the  array-tree  because  there  are 
fewer  connections.  This  will  be  examined  more  closely  In  the  next  section. 


Figure  6t  The  systolic  search  tree. 


6.  VLSI  Geometry  of  the  Systolic  Multiqueue 

Simple  and  regular  interconnections  in  a  VLSI  design  lead  to  cheap  implementations  and 
high  densities.  Communication  is  costly  in  VLSI,  and  as  the  technology  Improves,  the  time 
and  energy  required  for  communication  grows  in  comparison  with  that  needed  for 
processing.  Therefore,  the  geometry  of  the  systolic  multiqueue  must  be  considered  in 
evalueting  the  cost  of  a  VLSI  implementation. 

The  linearly  connected  systolic  array  easily  satisfies  the  requirement  of  having  a 
simple  geometric  realizatioa  The  number  of  externel  data  paths  is  small  as  well, 
emaneting  only  from  the  ends  of  the  structure.  As  was  shown  in  Kung  and  Leiserson 
[1978],  linearly  connected  systolic  arrays  are  ideal  for  implementation  in  VLSL 

More  interesting  are  the  structures  of  the  systolic  binary  trees.  Mead  and  Rem  [1978] 
substantiates  the  assumption  that  communication  information  from  the  leaf  of  a  VLSI  tree 
to  the  root  takes  time  proportional  to  the  height  of  the  tree.  The  fact  that  the  root  Is 
the  only  off -chip  connection  is  highly  desirable  for  VLSI  where  the  number  of  pins  on  an 
IC  package  is  a  severe  constraint. 
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Whereat  the  systolic  search  trae  presented  In  Section  5  can  use  either  geometry, 
routing  the  linear  connections  in  Figure  8  appears  to  be  more  complex.  The  tree  part  of 
the  array-tree  is  used  only  for  broadcasting,  however,  and  a  linear  ordering  of  the 
leaves  need  not  be  the  natural  ordering  shown  in  Figure  5.  This  makes  the  problem 
simpler,  and  leads  to  the  linear  area  geometry  of  Figure  9. 


Figure  9*  The  systolic  srrsy-tree  embedded  in  linear  area. 

There  is  an  advantage  to  the  geometry  shown  in  Figure  7  over  that  in  Figure  8.  The 
linear  area  solution  does  not  permit  connections  between  the  leaves  of  the  tree  and  the 
edge  of  the  chip.  Although  the  systolic  tree  in  the  systolic  multiqueue  does  not  need 
connections  to  the  leaves,  the  Ofo  leg  n)  area  embedding  can  be  used  to  make  a  chip  that 
will  permit  a  larger  systolic  tree  to  be  built  up  from  several  chips  based  on  the  lineer 
area  embedding.  Figure  10  shows  how  this  might  be  done.  The  decomposition  can  be 
very  efficient  since  the  number  of  linear  area  chips  dwarfs  the  number  of  (Xn  log  n)  area 
chips. 

7.  Conclusion 

The  systolic  multiqueue  can  be  attached  to  a  traditional  computer  system  just  like  any 
other  device.  Because  each  operation  on  a  queue  takes  constant  time,  however,  It  Is 
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0(n  log  n)  area  chip  with  external  connections  at  leaves. 
Linear  area  chip  with  external  connection  at  root  only. 


Figure  10:  A  large  systolic  tree  as  several  VLSI  chips. 

reasonable  to  connect  it  directly  to  the  CPU,  and  make  the  device  visible  to  the  user  by 
extending  the  instruction  set  of  the  computer.  Thus  the  INSERT  operation  might  be  a 
three  operand  instruction  taking  a  queue  number,  a  key,  and  a  pointer  to  data.  In  a 
multiprogramming  or  timesharing  environment  there  might  be  many  users  of  the  systolic 
multiqueue  at  the  same  time. 

A  priority  queue  is  not  an  obscure  data  structure,  and  the  uses  of  priority  queues  are 
many.  The  computation  time  of  sorting  alone  is  sufficient  to  justify  the  systolic 
multiqueue.  Internal  sorting  normally  takas  0(n  log  n)  time,  but  using  the  systolic 
muitiqueue,  we  save  a  factor  ofleg  *t  Just  insert  the  n  items  Into  one  of  the  queues,  and 
then  execute  n  EXTRACTION'S.  Not  only  does  the  computation  take  less  time,  bvt  the 
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load  on  the  CPU  of  tha  host  mac  hi  na  is  lessened.  Extarnal  sorts  frequently  usa  priority 
queue s.  For  instance,  tha  popular  replacement  selection  algorithm  (Knuth  [1973],  pp. 
291  -256)  has  a  priority  queue  as  its  primary  data  structure. 

Many  typos  of  search  can  be  speeded  up  by  utilizing  fast  priority  queues.  The  A* 
algorithm  (Nilsson  [197 1J  pp.  57-65),  for  instance,  chooses  the  best  of  many  possible 
alternatives  at  each  stage  of  the  search.  Many  game  playing  programs  using  alpha-beta 
search  sort  the  moves  at  each  level  in  the  game  tree  to  increase  the  number  of  cut-offs. 
As  the  program  searches  deeper  and  deeper,  tha  combinatorial  explosion  makes  this  very 
expensive,  and  therefore  tha  sort  is  frequently  abandoned  at  greater  search  depths. 
This  need  not  be  the  case  if  tha  computer  system  has  a  systolic  multiqueue. 

In  relational  databases,  the  join  operation  is  fraquantly  implemented  by  sorting  on  the 
chosen  fields  of  two  relations  and  than  performing  a  merge.  Algorithms  for  finding  the 
minimum  spanning  tree  or  convex  hull  of  a  sat  of  points  in  a  plana  can  use  the  systolic 
multiqueue.  A  priority  queue  is  also  useful  for  hidden  line  elimination.  Priority  queues 
are  used  In  operating  systems  for  resource  management. 

The  systolic  multiqueue  provides  insight  into  the  organization  of  special-purpose 
multiprocessor  devices  with  an  emphasis  on  a  VLSI  implementation.  The  sharing  of 
processors  by  several  independent  hardware  structures  is  a  key  issue.  A  tree  may  not 
be  the  optimal  shared  structure  for  a  given  set  of  constraints.  For  example,  a  systolic 
array  structure  that  performs  like  a  Young  tableau  (Knuth  [1973],  pp.  48-72)  might  be 
better  under  certain  conditions,  although  asymtotlcally  the  number  of  processors 
dedicated  to  a  single  queue  grows  as  the  square  root  of  tha  number  of  shared 
processors  rather  than  the  logarithm. 

The  systolic  multiqueue  can  be  optimized  and  modified  in  many  ways.  For  instance,  it 
is  easy  to  convert  the  systolic  multiqueue  into  a  systolic  multideque  that  implements  the 
priority  deque  operations  INSERT,  EXTRACT_MIN,  and  EXTRACT.MAX.  Sten  Andter  has 
observed  that  because  only  one  systolic  array  in  tha  systolic  multiqueue  need  operate  at 
a  time,  the  tn  systolic  arrays  of  langth  Oflag  n)  might  be  implemented  using  only  Odog  n) 
processors  each  having  enough  memory  to  hold  m  items.  Various  modifications  can  be 
made  to  the  broadcast  tree  in  tha  systolic  array-tree  and  to  the  switching  network  in  the 
systolic  multiqueue. 

^  , 

Advances  in  microelectronics  have  made  tha  realization  of  smart”  data  structures  a 
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practical  reality.  VLSI  give*  u*  the  capability  of  building  logic-in-memory  hardware  that 
will  drastically  change  how  things  are  computed.  Models  of  computation  based  solely  on 
the  Von  Neumann  architecture  will  be  insufficient  to  evaluate  algorithms.  Multiprocessor 
devices  IIKe  the  systolic  multiqueue  will  introduce  new  cost  functions  to  the  sequential 
algorithm  designer.  But  much  work  must  be  done  to  define  and  examine  the  models  of 
parallel  computation  that  lie  between  the  mathematical  world  of  computable  functions  and 
the  physical  world  of  space  and  time. 
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